49 research outputs found

    Short reasons for long vectors in HPC CPUs: a study based on RISC-V

    Full text link
    For years, SIMD/vector units have enhanced the capabilities of modern CPUs in High-Performance Computing (HPC) and mobile technology. Typical commercially-available SIMD units process up to 8 double-precision elements with one instruction. The optimal vector width and its impact on CPU throughput due to memory latency and bandwidth remain challenging research areas. This study examines the behavior of four computational kernels on a RISC-V core connected to a customizable vector unit, capable of operating up to 256 double precision elements per instruction. The four codes have been purposefully selected to represent non-dense workloads: SpMV, BFS, PageRank, FFT. The experimental setup allows us to measure their performance while varying the vector length, the memory latency, and bandwidth. Our results not only show that larger vector lengths allow for better tolerance of limitations in the memory subsystem but also offer hope to code developers beyond dense linear algebra.Comment: Accepted as paper at the Second RISC-V Workshops at SC23 - Denve

    Software Development Vehicles to enable extended and early co-design: a RISC-V and HPC case of study

    Full text link
    Prototyping HPC systems with low-to-mid technology readiness level (TRL) systems is critical for providing feedback to hardware designers, the system software team (e.g., compiler developers), and early adopters from the scientific community. The typical approach to hardware design and HPC system prototyping often limits feedback or only allows it at a late stage. In this paper, we present a set of tools for co-designing HPC systems, called software development vehicles (SDV). We use an innovative RISC-V design as a demonstrator, which includes a scalar CPU and a vector processing unit capable of operating large vectors up to 16 kbits. We provide an incremental methodology and early tangible evidence of the co-design process that provide feedback to improve both architecture and system software at a very early stage of system development.Comment: Presented at the "First International workshop on RISC-V for HPC" co-located with ISC23 in Hambur

    Performance and energy footprint assessment of FPGAs and GPUs on HPC systems using Astrophysics application

    Full text link
    New challenges in Astronomy and Astrophysics (AA) are urging the need for a large number of exceptionally computationally intensive simulations. "Exascale" (and beyond) computational facilities are mandatory to address the size of theoretical problems and data coming from the new generation of observational facilities in AA. Currently, the High Performance Computing (HPC) sector is undergoing a profound phase of innovation, in which the primary challenge to the achievement of the "Exascale" is the power-consumption. The goal of this work is to give some insights about performance and energy footprint of contemporary architectures for a real astrophysical application in an HPC context. We use a state-of-the-art N-body application that we re-engineered and optimized to exploit the heterogeneous underlying hardware fully. We quantitatively evaluate the impact of computation on energy consumption when running on four different platforms. Two of them represent the current HPC systems (Intel-based and equipped with NVIDIA GPUs), one is a micro-cluster based on ARM-MPSoC, and one is a "prototype towards Exascale" equipped with ARM-MPSoCs tightly coupled with FPGAs. We investigate the behavior of the different devices where the high-end GPUs excel in terms of time-to-solution while MPSoC-FPGA systems outperform GPUs in power consumption. Our experience reveals that considering FPGAs for computationally intensive application seems very promising, as their performance is improving to meet the requirements of scientific applications. This work can be a reference for future platforms development for astrophysics applications where computationally intensive calculations are required.Comment: 15 pages, 4 figures, 3 tables; Preprint (V2) submitted to MDPI (Special Issue: Energy-Efficient Computing on Parallel Architectures

    A general synthetic route for the preparation of high-spin molecules: Replacement of bridging hydroxo ligands in molecular clusters by end-on azido ligands

    Get PDF
    Abstract A general method of increasing the ground-state total spin value of a polynuclear 3d-metal complex is illustrated through selected examples from cobalt(II) and nickel(II) cluster chemistry that involves the dianion of the gem-diol form of di-2-pyridyl ketone and carboxylate ions as organic ligands. The approach is based on the replacement of hydroxo bridges, that most often propagate antiferromagnetic exchange interactions, by the end-on azido ligand, which is a ferromagnetic coupler

    An innovative low-cost Classification Scheme for combined multi-Gigabit IP and Ethernet Networks

    No full text
    Abstract — IP is certainly the most popular wide area network protocol while Ethernet is the most common Layer-2 network protocol, and it is currently being deployed beyond the tight borders of LANs. In order to accommodate the needs of MANs and WANs, several QoS mechanisms employed either at the IP layer or the MAC sublayer have been proposed. These QoS mechanisms require identification of network flows and the classification of network packets according to certain packet header fields. In this paper, we propose a classification engine employed either at the MAC sublayer or the IP layer, which is the successor of a scheme already successfuly implemented which is only employed at the MAC sublayer. This new scheme uses an innovative hashing scheme combined with an efficient trie-based structure. By using such techniques, the extremely high speed decisions –at a rate of more than 100Gb/sec- are supported, while the memory needs of the proposed engine are significantly lower compared to those of the similar schemes currently used. This engine has been implemented in hardware utilizing less than 0.2mm in a state of the art CMOS technology. As a result the proposed scheme is a very promising candidate for both the next-generation IP classification engines(probably incorporated within the high-end network processors) as well as for the Ethernet equipments that need to support classification at multi-Gigabit per second network speeds, while also employing the minimum amount of memory. I

    Αρχιτεκτονική υποστήριξη για μείωση κατανάλωσης ενέργειας στην επικοινωνία πολυπύρηνων επεξεργαστών υπό την καθοδήγηση λογισμικού

    No full text
    At the beginning of the 21st century, the processor industry made a fundamentalshift towards multicore architectures, in order to address the diminishingreturns in single-thread performance with increasing transistor counts, and in orderto overcome the severe power problems of clock frequency scaling. Semiconductortechnology trends indicate that now the era of power- and energy-constrainedmanycore architectures has come. Technology projections show that the energyconsumed by data movement and communication will dominate the correspondingbudget of future computing systems; thus, unnecessary data movements willsubtract significant energy margin from computations.The most popular communication model for multi-core and many-core architecturesis shared-memory. Threads or processes that run concurrently on differentcores communicate and exchange data by accessing the same global memory locations.However, accesses to off-chip memory are slow and, thus, processor designsutilize a hierarchy of faster on-chip memories to improve the speed of memory operations.Memory hierarchies today are based on two dominant schemes: (i) multilevelcoherent caches, and (ii) software-managed local memories (scratchpads).Caches manage the memory hierarchy transparently, using hardware replacementpolicies, and communication happens implicitly, with cache-coherence protocolsthat provoke data transfers between caches. Scratchpad memories are controlledby the programmer or the runtime software, and communication happens explicitly,through programmable DMA engines that perform the data transfers.This thesis proposes architectural support in the memory hierarchy to enablethe software to control data locality; we design programmable hardware primitivesthat allow runtime software to orchestrate communication and reduce the associatedenergy consumption.We demonstrate a hybrid cache/scratchpad memory hierarchy that providesunified hardware support for both implicit communication, via cache-coherence,and explicit communication, via fast virtualized inter-processor communicationhardware primitives. We also introduce the Epoch-based Cache Management(ECM), which allows software to assign priorities to cache-lines, in order to guidethe cache replacement policy, and, in effect, to manage locality. Moreover, wedesign the Explicit Bulk Prefetcher (EBP), a programmable prefetch engine thatallows software to accurately prefetch data ahead of time, in order to hide memory latency and improve cache locality. Furthermore, we propose a set of hardwareprimitives for Software Guided Coherence (SGC) in non-cache-coherent systems,in order to allow runtime software to orchestrate the fetching of the most up-todateversion of data from the appropriate cache(s) and maintain coherence at thesoftware object granularity.We evaluate our proposed hardware primitives by comparing them againstdirectory-based cache-coherence with hardware prefetching. Our experimental resultsfor explicit communication show that we can improve performance by 10% to40%, and at the same time reduce the energy consumption of on-chip communicationby 35% to 70% owing to significant reduction in on-chip traffic, by factors of2 to 4. Moreover, we exploit a task-based programming system to guide hardware,and show that our proposed hardware primitives in cache-coherent systems (ECM,EBP) improve performance by an average of 20%, inject 25% less on-chip trafficon average, and reduce the energy consumption in the components of the memoryhierarchy by an average of 28%. Our hardware support for non-cache-coherent systems(ECM, SGC) improves performance by an average of 14%, injects 41% lesson-chip traffic on average, and reduces the energy consumption in the componentsof the memory hierarchy by an average of 44%

    Optimization and Bottleneck Analysis of Network Block I/O in Commodity Storage Systems

    No full text
    Building commodity networked storage systems is an important architectural trend; Commodity servers hosting a moderate number of consumer-grade disks and interconnected with a high-performance network are an attractive option for improving storage system scalability and cost-efficiency. However, such systems incur significant overheads and are not able to deliver to applications the available throughput. We examine in detail the sources of overheads in such systems, using a working prototype to quantify the overheads associated with various parts of the I/O protocol. We optimize our base protocol to deal with small requests by batching them at the network level and without any I/O-specific knowledge. We also redesign our protocol stack to allow for asynchronous event processing, in-line, during send-path request processing. These techniques improve performance for a 8-disk SATA RAID0 array from 200 to 290 MBytes/s (45 % improvement). Using a ramdisk, peak performance improves from 320 to 474 MBytes/s (48 % improvement), which is 72 % of the maximum possible throughput in our experimental setup. We also analyze the remaining system bottlenecks, and find that although commodity storage systems have potential for building high-performance I/O subsystems, traditional network and I/O protocols are not fully capable of delivering this potential

    A Memory-Efficient Reconfigurable Aho-Corasick FSM Implementation for Intrusion Detection Systems

    No full text
    Abstract-The Aho-Corasick (AC) algorithm is a very flexible and efficient but memory-hungry pattern matching algorithm that can scan the existence of a query string among multiple test strings looking at each character exactly once, making it one of the main options for software-base intrusion detection systems such as SNORT. We present the Split-AC algorithm, which is a reconfigurable variation of the AC algorithm that exploits domain-specific characteristics of Intrusion Detection to reduce considerably the FSM memory requirements. SplitAC achieves an overall reduction between 28-75 % compared to the best proposed implementation. I

    ECOSCALE: Reconfigurable Computing and Runtime System for Future Exascale Systems

    Get PDF
    In order to reach exascale performance, current HPC systems need to be improved. Simple hardware scaling is not a feasible solution due to the increasing utility costs and power consumption limitations. Apart from improvements in implementation technology, what is needed is to refine the HPC application development flow as well as the system architecture of future HPC systems. ECOSCALE tackles these challenges by proposing a scalable programming environment and architecture, aiming to substantially reduce energy consumption as well as data traffic and latency. ECOSCALE introduces a novel heterogeneous energy-efficient hierarchical architecture, as well as a hybrid many-core+OpenCL programming environment and runtime system. The ECOSCALE approach is hierarchical and is expected to scale well by partitioning the physical system into multiple independent Workers (i.e. compute nodes). Workers are interconnected in a tree-like fashion and define a contiguous global address space that can be viewed either as a set of partitions in a Partitioned Global Address Space (PGAS), or as a set of nodes hierarchically interconnected via an MPI protocol. To further increase energy efficiency, as well as to provide resilience, the Workers employ reconfigurable accelerators mapped into the virtual address space utilizing a dual stage System Memory Management Unit with coherent memory access. The architecture supports shared partitioned reconfigurable resources accessed by any Worker in a PGAS partition, as well as automated hardware synthesis of these resources from an OpenCL-based programming model
    corecore